AD-A210  056 


I ! . . )  r<  L' .  I :  u.  ( ii  L 


'1  «  L'i  ^  ■  L*A*i  I  ■  f  TJ 


REPORT  DOCUMENTATION  PAGE 


lb.  RESTRICTIVE  MARKINGS  . 


3.  DISTRIBUTION /AVAILABILITY  OF  REPORT 
Approved  for  public  release; 
distribution  unlimited. 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 


HUbK 


•  TK- 


6a.  NAME  OF  PERFORMING  ORGANIZATION—  6b  OFFICE  S.Y.MBOL.  i  7a.  NAME  OF  MONITORING  ORGANIZATION 

Of  applicibit)  I 

,  I 

Jr  ,'^1^-.  -  _ 

6c  ADDRESS  (City.  State,  and  ZIP  Coda) 

<^j^iC0  (.Ir. li:.  ^ 


I  A  K  V  t-i  cJ 


to.  NAME  OF  FUNDING /SPONSORING 
ORGANIZATION 


8c  ADDRESS  (CKy,  SUtt.and  ZIP  Code) 
Building  410 

Bolling  AFB,  DC  20332-6448 


1 1 .  TITLE  Ondude  Seeurity  Clastifkation) 


■  p-'z-r';  i  'i',  P  1 


NAL  AUTHOR(S) 


7b.  ADDRESS  (Ofy,  State,  and  ZIP  Code) 
Building  410 

Bolling  AFB,  DC  20332-6448 


9.  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 


10.  SOURCE  OF  FUNDING  NUMBERS 


r  ra 


-  [) 


'•>  r-.  f . 


15.  PAGE  COUNT 


17.  COSATI  CODES  I  18.  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

FIELD  I  GROUP  I  SUB-GROUP  I 


19.  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number)  .  ■  ,  i  u  .j 

^  This  research  project  was  based  on  belief  that  modern  technology  has  substantially  changed 

the  flavor  of  problems  being  presented  to  the  statistician.  Electronic  instrumentation  implie 
an  ability  to  acquire  a  large  amount  of  high  dimensional  data  very  rapidly.  While  such  cap¬ 
abilities  have  existed  for  some  time,  the  emergence  of  cheap  RAM  in  the  1980 's  has  ^yen  us  th 
ability  to  store  and  access  that  data  in  an  active  computer  memory.  We  eentond  thaT  this 
represents  a  challenge  for  statisticians  which  is  substantially  different  in  kind.  The  majori 
of  existing  methodology  is  focused  on  the  univariate,  lid  random  variable  model.  Even  in  the 
circumstance  that  a  multivariate  model  is  allowed,  it  is  usually  assumed  to  be  multivariate 
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truth  of  the  matter  is  that  these  techniques  implicitly  assume  small  to  moderate  sample  sizes. 
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variables  and  8  million  observations  would.  The  reason  is  clear.  In  the  former  case  the 

t«  on  statiatical  efficiency  which  is  the  operational  goal  for  most  current  statistic 


20.  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT  21.  ABSTRACT  SECURITY  CLASSIFICATION 

□  UNCLASSIFIED/UNLIMITEO  □  SAME  AS  RPT  □  OTIC  USERS  UNCLASSIFIED 


22b.  TELEPHONE //nc/utto  Are*  Code)  22c.  OFFICE  SYMBOL 
(202)  767-  <-/^/V  ' 


DD  Form  1473,  JUN  86  Previous  editions  are  obsolete.  SECU 


39 


i  1 


♦ 

•»  • 


\ 
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reflection  of  the  mind-set  that  implicitly  focuses  on  small  to  moderate  sample  sizes  since 
few  parameters  do  not  make  sense  in  the  context  of  very  large  sample  sizes.  Finally,  we 
note  that  the  very  fact  of  largeness  in  sample  size  implies  that  it  is  unlikely  we  would 
see  iid  homogeneity. 
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This  research  project  was  based  on  our  belief  that  modern  technology  has  substantially 
changed  the  flavor  of  problems  being  presented  to  the  statistician.  Electronic  instrumentation  implies 
an  ability  to  acquire  a  large  amount  of  high  dimensional  data  very  rapidly.  While  such  capabilities 
have  existed  for  some  time,  the  emergence  of  cheap  RAM  in  the  1980’s  has  given  us  the  ability  to 
store  and  access  that  data  in  an  active  computer  memory.  We  contend  that  this  represents  a  challenge 
for  statisticians  which  is  substantially  different  in  kind.  The  majority  of  existing  methodology  is 
focused  on  the  univariate,  iid  random  variable  model.  Even  in  the  circumstance  that  a  multivariate 
model  is  allowed,  it  is  usually  assumed  to  be  multivariate  normal.  We  contend,  in  addition,  that  while 
arbitrary  sample  size  is  frequently  assumed,  the  truth  of  the  matter  is  that  these  techniques  implicitly 
assume  small  to  moderate  sample  sizes.  For  example,  a  regression  problem  with  5  design  variables  and 
1000  observations  would  represent  no  problem  for  traditional  techniques.  By  contrast,  a  regression 
problem  with  40,000  design  variables  and  8  million  observations  would.  The  reason  is  clear.  In  the 
former  case  the  emphasis  is  on  statistical  elTiciency  which  is  the  operational  goal  for  most  current 
statistical  technology.  By  contrast,  in  the  latter  case,  emphasis  must  be  clearly  on  computational 
efflciency.  The  emphasis  on  parsimony  in  many  contemporary  books  and  papers  is  a  further  reflection 
of  the  mind-set  that  implicitly  focuses  on  small  to  moderate  sample  sizes  since  few  parameters  do  not 
make  sense  in  the  context  of  very  large  sample  sizes.  Finally,  we  note  that  the  very  fact  of  largeness  in 
sample  size  implies  that  it  is  unlikely  we  would  see  iid  homogeneity. 

Thus,  a  new  perspective  is  required  for  large,  high  dimensional  data  sets.  This  project  was 
intended  to  explore  a  variety  of  mathematics  and  statistical  theory  related  to  this  perspective. 
Specifically,  we  focused  on  graphical  methods  for  data  representation  in  higher  dimensions,  structural 
inference  and  the  computational  issues  related  to  these  problems.  We  note  that  in  a  highly 
multivariate  setting,  there  is  increased  opportunity  to  focus  on  the  functional  relationship  between 
random  variables  (what  we  refer  to  as  structural  inference)  rather  than  simply  focusing  on  the 
probability  distribution  of  random  variable  (traditional  statistical  inference).  This  perspective  has  been 
discussed  in  more  detail  in  papers  12  and  17  listed  below.  All  of  these  topics  we  believe  will  intertwine 
to  form  the  conceptual  core  of  our  approach  to  this  new  perspective.  The  tasks  we  proposed  focus 
respectively  on  graphical  issues  (item  1),  on  structural  inference  issues  (item  2)  and  on  computational 
issues  (item  3). 

The  items  listed  below  are  publications  produced  under  this  AFOSR  grant  during  the  last  two 


years. 
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23.  “Optimal  estimation  of  transient  signals  via  recursive  splines,”  with  Hung  T.  Le,  Center  for 
Computational  Statistics  Technical  Report  No.  40,  George  Mason  University,  1989. 

24.  “Parallel  computing  and  statistics,”  Center  for  Computational  Statistics  Technical  Report  No.  41, 
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25.  Hung  Tri  Le,  “A  Functional  Analytic  Approach  to  Transient  Signal  Detection  and  Estimation,” 
Center  of  Computational  Statistics  Technical  Report  45,  George  Mason  University,  1989 

SOFTWARE 

26.  Mason  Hypergraphics,  copyright  (c)  1988,  1989  by  Edward  J.  Wegman,  a  MS-DOS  package  for 
high-dimensional  data  analysis. 

Items  7,  13,  16,  18  and  26  address  graphical  issues  and  detail  a  series  of  graphical  devices  we 
have  invented  to  assist  in  the  development  of  high-dimensional  data  analysis.  Items  5,  11,  14,  19,  22 
and  24  have  a  primarily  computational  flavor  and  focus  on  improving  algorithms.  Items  6,  9,  10,  20, 
21  and  23  focus  on  structural  inference  issues.  Items  8,  12,  15  and  17  are  contibutions  which  are 
general  in  nature. 

In  addition  to  the  technical  reports  and  papers,  the  project  has  produced  a  Ph.D.  student. 
Hung  Tri  Le,  who  completed  his  dissertation  on  the  topic,  A  Functional  Analytic  Approach  to 
Transient  Signal  Detection  and  Estimation.  In  addition,  we  have  had  to  Masters  students,  Masood 
Bolorforoush  and  Mingxian  Xu.  We  believe  that  the  project  has  been  extraordinarily  productive.  We 
are  very  grateful  for  the  opportunity  to  work  under  Air  Force  sponsorship.  It  has  allowed  us  not  only 
to  develop  a  substantial  amount  of  research,  but  also  to  to  develop  our  facilties  and  the  technical  skills 
of  some  of  our  younger  faculty  and  students.  The  related  instrumentation  funding  has  allowed  us  to 
install  a  Intel  iPSC/2  d4/VX  message  passing  parallel  computer  as  well  as  a  Silicon  Graphics  IRIS  4D- 
120-GTB  Graphics  Workstation.  Both  of  these  facilities  as  well  as  the  faculty  development  will  pay 


dividends  for  the  Air  Force  and  for  the  general  scientific  community  well  beyond  the  term  of  the 


contract. 


